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Abstract 

Background: A number of evolutionary models have been widely used for sequence alignment, phylogenetic tree 
reconstruction, and database searches. These models focus on how sets of independent substitutions between amino 
acids or codons derive one protein sequence from its ancestral sequence during evolution. In this paper, we regard 
the Empirical Codon Mutation (ECM) Matrix as a communication channel and compute the corresponding channel 
capacity. 

Results: The channel capacity of 4.1 875 bit, which is needed to preserve the information determined by the amino 
acid distribution, is obtained with an exponential factor of 0.26 applied to the ECM matrix. Additionally, we have 
obtained the optimum capacity achieving codon distribution. Compared to the biological distribution, there is an 
obvious difference, however, the distribution among synonymous codons is preserved. More importantly, the results 
show that the biological codon distribution allows for a "transmission" at a rate very close to the capacity. 

Conclusion: We computed an exponential factor for the ECM matrix that would still allow for preserving the genetic 
information given the redundancy that is present in the codon-to-amino acid mapping. This gives an insight how 
such a mutation matrix relates to the preservation of a species in an information-theoretic sense. 



Background 

Markov models for the protein sequence evolution have 
been widely used for the past 40 years. These evolutionary 
matrices highlight the most common mutational changes 
between amino acids and codons. Protein sequence evo- 
lution has been investigated at both amino acid and codon 
levels. The evolutionary matrices on the basis of amino 
acids are widely used for sequence alignments and phylo- 
genetic tree construction. As more than one codon encode 
to the same amino acid, it is easy to estimate alignments 
in amino acids as compared to codons. 

Codon-level models demonstrate the mutational changes 
among the codons. This gives us more information by 
highlighting the tendency of mutations between codons 
encoding the same amino acid (synonymous changes) as 
well as the mutational effects between codons that code 
for different amino acids (non-synonymous changes). 
As codons are the smallest genetic information unit in 
protein-encoding regions, it is obvious to model mutations 
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by a codon-based communications channel model high- 
lighting all codon-to-codon changes in nature. 

Substitution matrices define the rate at which one amino 
acid in the protein sequence is changed into another 
amino acid. Dayhoff et al. [1] estimated the first such 
model in 1972, resulting in the widely used point accepted 
mutations (PAM) matrix. It is computed by counting the 
mutations in the closely related proteins. Henikoff and 
Henikoff proposed the block substitution matrix (BLOSUM) 
for divergent protein sequences, which uses log-likelihood 
ratios to construct scoring matrices from the transition 
matrices between amino acids [2]. Later on, Whelan and 
Goldman (WAG) proposed a novel approach to estimate 
amino acid replacement matrices from a large database 
of aligned protein sequences in 2001 [3]. It combines 
the estimation of transition and scoring matrices by a 
maximum-likelihood approach that accounts for the phy- 
logenies of sequences within each training alignment. 

As the codon (a tri-nucleotide) is the basic genetic 
information that directly encodes the amino acid as the 
building block of proteins, we have used the first empiri- 
cal codon substitution matrix (ECM) in our analysis. This 
was proposed by Schneider et al., where sequences of five 
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vertebrates were aligned and the number of codon sub- 
stitutions were counted among them [4]. According to 
conversations with the authors, it is estimated that these 
mutations on average happened in roughly 300 Million 
years. 

Yockey was one of the first to model and describe a 
central dogma using information theoretic tools [5]. He 
viewed the flow of information from DNA or RNA to pro- 
teins as a communication system and employed entropy, 
rate, and capacity calculations with a transition matrix he 
developed by considering base changes of equal probabil- 
ity. A detailed analysis of the application of information 
theory to molecular biology can be found in his book 
[6]. Relatively recently, L. Gong, N. Bouaynaya, and D. 
Schonfeld have proposed a communication model for pro- 
tein evolution [7]. They used the amino acid based PAM 
matrix and a matrix they produced, similar to Yockey s, as 
a communication channel and performed capacity calcu- 
lations over it. 

We computed an exponential factor for the ECM matrix 
that would still allow for preserving the genetic informa- 
tion given the redundancy that is present in the codon-to- 
amino acid mapping. This gives an insight on how such a 
mutation matrix relates to the preservation of a species in 
an information-theoretic sense. 

For the underlying capacity computation, we used the 
Arimoto-Blahut algorithm [8,9] to determine the input 
distribution that maximizes the mutual information. 

Methods 

In order to compute the mutation probability in the ECM 
matrix, 17502 alignments of sequences from five verte- 
brate genomes yielded 8.3 million aligned codons from 
which the number of substitutions between codons were 
counted. This matrix has 64 x 64 entries stating the muta- 
tion probability of each codon to every other codon. Basi- 
cally, the substitution from sense codons to stop codons 
is not included in the ECM matrix, which makes the 
matrix block diagonal with a 61 x 61 matrix for cod- 
ing codons and a 3 x 3 entries for substitutions between 
stop codons. Therefore, we will consider only substitu- 
tions between coding codons and regard the ECM matrix 
as 61 X 61. From the communication perspective, this 
mutation matrix describes channel transition probabilities 

There is also another matrix in [4], which gives the 
actual count of substitutions observed. From this substi- 
tution count matrix C, we obtained the biological proba- 
bility distribution of the codons as 

Px = • (1) 

2^ 2^ ^ij 

i J 



Thereafter, we combined the codons which encode for 
the same amino acid and computed the probability distri- 
bution of amino acids, denoted p^. Using this distribution, 
the to be preserved information content of the 64 codons 
representing the 20 amino acids can be computed as 

20 

R20 = -J2 log2(P^(0). (2) 

1=1 

According to Shannon's channel coding theorem, a 
communication through a noisy channel of capacity C 
at an information rate of R is possible with an arbitrar- 
ily small probability of error, if R < C [10]. Hence, the 
channel capacity has to, at least, exceed the value of i?20- 

In communication systems, the channel capacity is 
determined by maximizing the mutual information 
between input (X) and output (Y) over the input probabil- 
ity distribution p;,^- 

C = supp/(X;r). (3) 

I(X; Y) is the mutual information which measures the 
mutual dependence between input and output distribu- 
tions, and is defined as 

I(X;Y)=H(Y)-H(Y\Xl (4) 

where H(Y) is the entropy of the codon distribution at 
the output of the ECM "channel", and H(Y\X) is the 
conditional entropy, called prevarication or irrelevance. 

However, in the system we are considering, the input 
distribution (i.e., probability distribution of codons) is 
not something to adjust. It is defined by nature. There- 
fore, we determine the channel capacity corresponding 
to the mutation "channel" matrix for a biological codon 
frequency obtained by Eq. (1). H(Y) is computed as 

61 

H(Y) = -J2PyMM> (5) 

i=l 

where py. is the output probability distribution of the i^^ 
codon. The conditional entropy H(Y\X) between input 
and output distribution of codons is computed as 

61 61 

H(Y\X) = - Y^Pi^i) E^O^/I^^') \og2Piyj\Xi), (6) 

i=l j=l 

p{yj\xi) is the conditional probability between codons, 
which is given by the empirical codon mutation (ECM) 
matrix. 

We now compute, what exponent of the ECM matrix 
would be needed to make the capacity just match the 
required rate obtained by Eq. (2). Hereto, we use the 
singular value decomposition (SVD) yielding 

[Piy\x)f = U(E)^V*, (7) 
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where U, V are unitary matrices, E is a diagonal matrix 
with nonnegative real numbers in the diagonal, and F is 
an exponent to be fine-tuned. The value of the exponent 
is changed in steps from zero to one. A value of 1 means 
the original ECM matrix is used. 

Moreover, we would like to find the optimum codon 
distribution by solving Eq. (3) and compare it with the 
biological distribution. For solving the optimization prob- 
lem, the Arimoto-Blahut algorithm was employed [8,9]. 
The Arimoto-Blahut algorithm is an iterative numerical 
algorithm that monotonically converges to the capacity 
value. To compute the capacity, it is starting from any arbi- 
trary input probability distribution p^^ (usually uniform) 
and performs the following two steps until the algorithm 
converges. 

1. Compute a quantity related to the mutual 
information per input symbol 

This results from a Lagrange multiplier step in [9]. 

2. Update the input probability distribution according 
to 



p(xj) 



p(xj)c(xj) 



(9) 



The termination criteria is based on the lower and upper 
bounds of the channel capacity. 



log 



K j } 



C < log ^max c(xj)j . (10) 



The iterations are terminated when the upper and lower 
bounds are equal up to a certain accuracy. 

Once the optimized codon distribution is obtained 
using the Arimoto-Blahut algorithm, to note the similarity 
with the biological distribution, we applied the so called 
Kullback-Leibler divergence (D/a) [H]- ^/a is a quantita- 
tive measure of how similar a probability distribution P is 
to a model distribution Q, and is defined as 



(11) 



Dkl is non-negative and gives a zero result when the 
distributions are perfectly matched. Technically speaking, 
Dkl measures the average number of extra bits required 
(coding penalty) for using a code based on Q instead of P, 

Results and discussion 

The to be preserved information content of the amino 
acids, using the amino acid distribution and computed 



according to Eq. (2) is 4.1875 bit, which is less than the 
maximum value of log2(20) = 4.3219 bit. Likewise, the 
required rate obtained by using the amino acid probabil- 
ity distribution provided by King & Jukes in [12], derived 
from 5492 residues of 53 vertebrate polypeptides is 4.2033 
bit. Thus, it is reasonable to look for a capacity that is 
at least greater than 4.1875. Hence, using the biologi- 
cal codon distribution in the five vertebrates obtained 
by using Eq. (1), we stepwise reduced the exponent of 
the ECM matrix until it satisfies the rate requirement. 
Furthermore, we used the Arimoto-Blahut algorithm to 
find the optimal input probability distribution of the 61 
codons to maximize the mutual information and compare 
it with the biological distribution of codons. The optimal 
capacity-achieving codon distribution and the observed 
biological codon distribution are both shown in Figure 1. 
The corresponding values are also tabulated in Table 1 and 
Table 2. 

The capacity obtained by optimizing the codon distri- 
bution, the mutual information based on the observed 
biological codon distribution, and the required rate are 
shown together in Figure 2. When the exponent of the 
ECM matrix is reduced, the output codon distribution 
changes and the prevarication H(Y\X) will be smaller. 
As a result, the capacity increases. The maximal expo- 
nent which satisfies the rate requirement of 4.1854 bit for 
an error-free "transmission" using the biological codon 
frequency is found to be ^ 0.26. At the same exponent, 
the optimized "channel" capacity is 4.2586 bit. It can also 
be seen that the capacity curve is very close to the one 
found by using the biological codon distribution. This 
indicates that the biological probability distribution is 
almost optimally "chosen" to achieving the capacity of the 
"channel". 

It is not surprising that the exponent is not one, since the 
matrix was obtained comparing five different vertebrate 
DNAs, the times corresponding to time spans between 40 
M - 350 M years. However, the exponent is not extremely 
small, which indicates that the matrix is at least roughly in 
agreement with information-theoretic calculations. One 
may also see this as an argument to recompute the matrix 
using the obtained coefficient. 

To see how well the biological and the optimized codon 
distributions agree, we applied the KuU-back-Leibler 
divergence (Dkl) and obtained a value of 0.0926 bit, which 
is not a very small difference (comparable with the Dkl 
of two Gaussians of equal mean and a variance differing 
by a factor of two) but still, similarities are obvious. Both 
of the probability distributions satisfy the rate require- 
ment of 4.1875 bit. In addition, the distribution among 
synonymous codons is very similar. To mention one exam- 
ple, codons encoding Alanine (A) in decreasing order of 
abundance, is GCC, GCT, GCA, and GCG, for both the 
biological and the capacity-achieving distributions. 
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Figure 1 Probability distribution of codons (Biological and Optimal). The optimum codon distribution to maximize mutual information and 
tine biological distribution of codons in the five vertebrates. Consecutive bins indicate that the codons belong to the same encoded amino acid 
(one letter symbol). The synonymous codons are arranged alphabetically. 



Conclusion 

From the so-called empirical codon substitution matrix 
(ECM), a mutation probability matrix, we derived the 
capacity when regarding the matrix as a communication 
channel We found that an exponent of 0.26 would lead to 



a capacity of 4.1875 bit that is at least required to preserve 
the genetic information represented by the 20 amino acids 
encoded by 64 codons. Additionally, for the desired chan- 
nel capacity, we have presented the optimal codon distri- 
bution found by searching the distribution that maximizes 



Table 1 Biological codon relative frequency 





Codon 


Frequency 


Codon 


Frequency 


Codon 


Frequency 


Codon 


Frequency 




T 


TTT 


0.0191 


TCT 


0.0171 


TAT 


0.0132 


TGT 


0.0110 


T 




TTC 


0.0196 


TCC 


0.0160 


TAC 


0.0160 


TGC 


0.0119 


C 






0.0085 


TCA 


0.0133 


TAA 


0.0003 


TGA 


0.0003 


A 




TTG 


0.0141 


TCG 


0.0043 


TAG 


0.0001 


TGG 


0.0125 


G 


C 


CYT 


0.0150 


CCT 


0.0176 


CAT 


0.0116 


CGT 


0.0054 


T 




CYC 


0.0173 


CCC 


0.0150 


CAC 


0.0144 


CGC 


0.0087 


C 




CTA 


0.0080 


CCA 


0.0178 


CAA 


0.0137 


CGA 


0.0062 


A 




CTG 


0.0373 


CCG 


0.0059 


CAG 


0.0337 


CGG 


0.0085 


G 


A 




0.0175 


ACT 


0.0144 


AAT 


0.0182 


AGT 


0.0136 


T 




ATC 


0.0200 


ACC 


0.0160 


AAC 


0.0206 


AGC 


0.0191 


C 




ATA 


0.0094 


ACA 


0.0169 


AAA 


0.0282 


AGA 


0.0135 


A 




ATG 


0.0219 


ACG 


0.0059 


AAG 


0.0319 


AGG 


0.0118 


G 


G 


G^ 


0.0136 


GCT 


0.0200 


GAT 


0.0252 


GGT 


0.0115 


T 




GTC 


0.0138 


GCC 


0.0213 


GAC 


0.0246 


GGC 


0.0176 


C 




GTA 


0.0084 


GCA 


0.0179 


GAA 


0.0311 


GGA 


0.0184 


A 




GTG 


0.0265 


GCG 


0.0060 


GAG 


0.0389 


GGG 


0.0133 


G 






T 




C 




A 




G 





The codon relative frequency of the five vertebrates genomes (human, mouse, chicken, frog, and zebrafish) from the data presented by Schneider A., Cannarozzi G., 
and Gonnet G. [4]. 
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Table 2 Calculated codon relative frequency 





Codon 


Frequency 


Codon 


Frequency 


Codon 


Frequency 


Codon 


Frequency 




T 


TTT 


0.0257 


TCT 


0.0113 


TAT 


0.0207 


TGT 


0.0215 


T 




TTC 


0.0264 


TCC 


0.0150 


TAC 


0.0260 


TGC 


0.0247 


C 






0.0097 


TCA 


0.0100 


TAA 




TGA 




A 




TTG 


0.0119 


TCG 


0.0066 


TAG 




TGG 


0.0439 


G 


C 


CYT 


0.0118 


CCT 


0.0159 


CAT 


0.0141 


CGT 


0.0073 


T 




CYC 


0.0150 


CCC 


0.0162 


CAC 


0.0183 


CGC 


0.0129 


C 




CTA 


0.0054 


CCA 


0.0161 


CAA 


0.0144 


CGA 


0.0077 


A 




CTG 


0.0277 


CCG 


0.0085 


CAG 


0.0337 


CGG 


0.0065 


G 


A 




0.0162 


ACT 


0.0071 


AAT 


0.0160 


AGT 


0.0130 


T 




ATC 


0.0205 


ACC 


0.0128 


AAC 


0.0212 


AGC 


0.0163 


C 




ATA 


0.0088 


ACA 


0.0093 


AAA 


0.0251 


AGA 


0.0157 


A 




ATG 


0.0330 


ACG 


0.0079 


AAG 


0.0261 


AGG 


0.0122 


G 


G 


G^ 


0.0096 


GCT 


0.0132 


GAT 


0.0234 


GGT 


0.0114 


T 




GTC 


0.0114 


GCC 


0.0172 


GAC 


0.0228 


GGC 


0.0162 


C 




GTA 


0.0060 


GCA 


0.0110 


GAA 


0.0235 


GGA 


0.0183 


A 




GTG 


0.0260 


GCG 


0.0048 


GAG 


0.0263 


GGG 


0.0126 


G 






T 




C 




A 




G 





The codon relative frequency that maximizes the mutual information between input and output and yielding a capacity close to what is required for preserving the 
information content of amino acids. An exponential factor of 0.26 is applied to the ECM matrix. 



the mutual information starting from a random initial- 
ization, A comparison of the biological distribution with 
the optimized codon distribution shows that the two dis- 
tributions are not too similar. However, the biological 
distribution is not too far from the capacity-achieving dis- 
tribution in terms of "channel" capacity, which indicates 



that the biological distribution is well "chosen". Addition- 
ally, the optimal codon distribution has preserved the 
relative abundance of synonymous codons. We concluded 
that the ECM as a channel is not too far from what would 
be expected following information theoretic arguments 
although it was derived from 5 different species. 
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Figure 2 Capacity as a function of an exponential factor. The required rate for error-free transmission and the achievable capacity are plotted as 
a function of the exponent of the ECM matrix. 
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